Prediction of Protein Solubility in Escherichia Coli Using Discriminant Analysis, Logistic Regression, and Artificial Neural Network Models

نویسندگان

  • Reese Lennarson
  • Rex Richard
  • Miguel Bagajewicz
چکیده

Recombinant DNA technology is important in the mass production of proteins for academic, medical, and industrial use, and the prediction of the solubility of proteins is a significant part of it. However, the protein solubility when overexpressed in a host organism is difficult to predict. Thus, a model capable of accurately estimating the likelihood of proteins to form insoluble inclusion bodies would be highly useful in many applications, indicating whether proteins necessitate chaperones to remain soluble under the conditions within the host organism. To this end, solubility data for proteins when overexpressed in Escherichia coli was compiled, and properties of the proteins likely affecting solubility were identified as parameters for building solubility prediction models. In this paper, three models were constructed using discriminant analysis, logistic regression, and neural networks. Significant parameters were determined, and the efficiencies of solubility prediction for the three procedures were compared. Among the properties investigated, α-helix propensity and asparagine fraction were the most important parameters in the discriminant analysis model; for logistic regression, molecular weight, total number of hydrophobic residues, hydrophilicity index, approximate charge average, asparagine fraction, and tyrosine fraction were found to be the greatest contributors to protein solubility. For the neural network, the most important parameters included the asparagine fraction, total number of hydrophobic residues, and tyrosine fraction. The asparagine fraction was of great importance, as it was the only parameter found to be among the five most significant parameters in all three models. Post hoc evaluations of the models indicated that the discriminant analysis model was 66.5% accurate, the logistic regression model was 73.9% accurate, and the neural network model was 91.0% accurate. For the logistic regression model, post hoc accuracies were shown to increase as predictions of solubility or insolubility neared high probabilities. A priori evaluations were used to determine how well logistic regression and the neural network would predict solubility of new proteins. The discriminant analysis was excluded from this study because its post hoc accuracy was exceedingly low. These studies showed that the logistic regression models tended to give higher prediction accuracies than neural networks for proteins not previously used in creating the respective models, but logistic regression predictions were highly skewed toward insolubility, while neural network predictions were more balanced overall.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparison of Gestational Diabetes Prediction Between Logistic Regression, Discriminant Analysis, Decision Tree and Artificial Neural Network Models

Background and Objectives: Gestational Diabetes Mellitus (GDM) is the most common metabolic disorder in pregnancy. In case of early detection, some of its complications can be prevented. The aim of this study was to investigate early prediction of GDM by logistic regression (LR), discriminant analysis (DA), decision tree (DT) and perceptron artificial neural network (ANN) and to compare these m...

متن کامل

Comparison of artificial neural network with logistic regression in prediction of tendency to surgical intervention in nurses

Introduction: Logistic regression is one of the modeling methods for bipartite dependent variables. On the other hand, artificial neural network is a flexible method with the least limitation. The importance of growing unnecessary beauty surgeries and the importance of prediction and classification made us consider the present study, with the aim of comparing logistic regression and artificial ...

متن کامل

The Comparison of Credit Risk between Artificial Neural Network and Logistic Regression Models in Tose-Taavon Bank in Guilan

One of the most important issues always facing banks and financial institutes is the issue of credit risk or the possibility of failure in the fulfillment of obligations by applicants who are receiving credit facilities. The considerable number of banks’ delayed loan payments all around the world shows the importance of this issue and the necessary consideration of this topic. Accordingly...

متن کامل

ARTICLE Prediction of Protein Solubility in Escherichia coli Using Logistic Regression

In this article we present a new and more accurate model for the prediction of the solubility of proteins overexpressed in the bacterium Escherichia coli. The model uses the statistical technique of logistic regression. To build this model, 32 parameters that could potentially correlate well with solubility were used. In addition, the protein database was expanded compared to those used previou...

متن کامل

Enhancing Efficiency of Neural Network Model in Prediction of Firms Financial Crisis Using Input Space Dimension Reduction Techniques

The main focus in this study is on data pre-processing, reduction in number of inputs or input space size reduction the purpose of which is the justified generalization of data set in smaller dimensions without losing the most significant data. In case the input space is large, the most important input variables can be identified from which insignificant variables are eliminated, or a variable ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007